Estimate partition memory usage based on previous attempts #11857

losipiuk · 2022-04-07T20:43:08Z

Description

Estimate partition memory usage based on previous attempts.
This applies for execution with task-level retries when bin-packing node allocator is selected

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core engine

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

losipiuk · 2022-04-07T20:43:28Z

@arhimondr do we need RN for this?

arhimondr

LGTM % comments

.../trino-main/src/main/java/io/trino/execution/scheduler/ConstantPartitionMemoryEstimator.java

core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java

...in/src/main/java/io/trino/execution/scheduler/ExponentialGrowthPartitionMemoryEstimator.java

losipiuk · 2022-04-07T22:36:51Z

Rebased on top of #11861

losipiuk · 2022-04-07T22:59:54Z

AC

linzebing · 2022-04-08T00:53:26Z

core/trino-main/src/main/java/io/trino/execution/scheduler/PartitionMemoryEstimatorFactory.java

+package io.trino.execution.scheduler;
+
+@FunctionalInterface
+public interface PartitionMemoryEstimatorFactory


QQ: why having a factory is beneficial here?

We want a new instance for each stage, so we can do estimations based on different tasks which completed for this stage.

core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java

...in/src/main/java/io/trino/execution/scheduler/ExponentialGrowthPartitionMemoryEstimator.java

linzebing · 2022-04-08T01:34:20Z

...in/src/main/java/io/trino/execution/scheduler/ExponentialGrowthPartitionMemoryEstimator.java

+            // take previousRequiredBytes into account when registering failure on oom. It is conservative hence safer (and in-line with getNextRetryMemoryRequirements)
+            long previousRequiredBytes = previousMemoryRequirements.getRequiredMemory().toBytes();
+            long previousPeakBytes = peakMemoryUsage.toBytes();
+            memoryUsageDistribution.add(Math.max(previousRequiredBytes, previousPeakBytes) * growthFactor);


Why would we add estimated memory usage to the distribution? If retry succeeds, then the actual usage will be added, right? This seems to skew the metric.

I understand your intention though. If one task consumes large amount of memory, then other tasks may also need large amount of memory. But this will make the stats collection inaccurate, maybe we should explore some other approach instead.

Yeah - this surely is not exact science and I am not sure how well it will work in practice. But intention is exactly what you wrote. If we see that tasks are dying because we gave them too little memory we want to bump initial memory already for new tasks. Not wait until we have one which succeeds (it may take a lot of time till we have one).

I was thinking first about having two separate histograms for successful and unsuccessful tries. And make the one for unsuccessful decaying over time so "newer data is more important" - but I did not come up with reasonable way to merge the data from both, so I implemented the simple (yet I agree not 100% bullet-proof) approach.

Happy to hear suggestions how to improve though :)

BTW: I will add a commit on top with extra debug logging so we can see how it works in practice when testing out queries on cluster.

Hmmm. Actually I guess with some tuning your approach might work well in practice. We can leave it as it is for now.

cla-bot bot added the cla-signed label Apr 7, 2022

losipiuk requested review from arhimondr and linzebing April 7, 2022 20:43

arhimondr approved these changes Apr 7, 2022

View reviewed changes

.../trino-main/src/main/java/io/trino/execution/scheduler/ConstantPartitionMemoryEstimator.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/execution/scheduler/FaultTolerantStageScheduler.java Outdated Show resolved Hide resolved

arhimondr reviewed Apr 7, 2022

View reviewed changes

...in/src/main/java/io/trino/execution/scheduler/ExponentialGrowthPartitionMemoryEstimator.java Outdated Show resolved Hide resolved

losipiuk force-pushed the lo/smarter-partition-sizing branch from bd5e8b0 to 835f201 Compare April 7, 2022 22:36

losipiuk force-pushed the lo/smarter-partition-sizing branch from 835f201 to f7ea0b7 Compare April 7, 2022 22:57

arhimondr approved these changes Apr 7, 2022

View reviewed changes

github-actions bot added the docs label Apr 7, 2022

linzebing reviewed Apr 8, 2022

View reviewed changes

losipiuk force-pushed the lo/smarter-partition-sizing branch from f7ea0b7 to 6ea2208 Compare April 8, 2022 13:04

losipiuk added 5 commits April 8, 2022 19:22

Static import

1f51e08

Introduce factory for PartitionMemoryEstimator

0621b7c

Register partition processing finished in size estimator

6c08e66

Estimate partition memory usage based on previous attempts

29e11fa

Add memory requirements debug logs to stage scheduler

57457cf

arhimondr approved these changes Apr 8, 2022

View reviewed changes

losipiuk force-pushed the lo/smarter-partition-sizing branch from 6ea2208 to 57457cf Compare April 8, 2022 17:24

linzebing approved these changes Apr 8, 2022

View reviewed changes

losipiuk merged commit 6fba6a4 into trinodb:master Apr 9, 2022

github-actions bot added this to the 377 milestone Apr 9, 2022

mosabua mentioned this pull request Apr 11, 2022

Add Trino 377 release notes #11859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate partition memory usage based on previous attempts #11857

Estimate partition memory usage based on previous attempts #11857

losipiuk commented Apr 7, 2022

losipiuk commented Apr 7, 2022

arhimondr left a comment

losipiuk commented Apr 7, 2022

losipiuk commented Apr 7, 2022

linzebing Apr 8, 2022

losipiuk Apr 8, 2022

linzebing Apr 8, 2022

losipiuk Apr 8, 2022

linzebing Apr 8, 2022

Estimate partition memory usage based on previous attempts #11857

Estimate partition memory usage based on previous attempts #11857

Conversation

losipiuk commented Apr 7, 2022

Description

Documentation

Release notes

losipiuk commented Apr 7, 2022

arhimondr left a comment

Choose a reason for hiding this comment

losipiuk commented Apr 7, 2022

losipiuk commented Apr 7, 2022

linzebing Apr 8, 2022

Choose a reason for hiding this comment

losipiuk Apr 8, 2022

Choose a reason for hiding this comment

linzebing Apr 8, 2022

Choose a reason for hiding this comment

losipiuk Apr 8, 2022

Choose a reason for hiding this comment

linzebing Apr 8, 2022

Choose a reason for hiding this comment